IN THE CLAIMS 

Claims 1, 6, 10, 12, 13, 14, 24, 28 and 36 are amended herein. Claims 2, 7, 
15, and 26 are cancelled. All pending claims are produced below. 



1 . (Currently Amended) A system for finding compounds in a text 
corpus, comprising: 

a vocabulary comprising tokens extracted from a text corpus; and 
a compound finder configured to iteratively identify identifying 

compounds having a plurality of lengths within the text corpus, 
each compound comprising a plurality of tokens, comprising: 
an iterator configured to select »-grams having a same 

length that is less than a length of »-grams selected 
during a previous iteration; 
an M-gram counter configured to evaluate evaluating a 

frequency of occurrence for one or more n-grams 
having the same length in the text corpus, each n- 
gram comprising at least one token tok e ns selected 
from the vocabulary; and 
a likelihood evaluator configured to determine determining 
a likelihood of collocation for one or more of the n- 
grams having a-the same length, adding the a subset 
of n-grams having a high highest likelihood as 
compounds to the vocabulary and rebuilding the 
vocabulary based on the added compounds. 

2. (Cancelled) 

3 . (Currently Amended) A system according to Claim 1 , wherein 
only some of the subset of «-grams having a high highest likelihood are added as 
compounds to the vocabulary. 
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4. (Original) A system according to Claim 1 , wherein the likelihood 
of collocation as a likelihood ratio X is computed in accordance with the formula: 



where L(H,) is a likelihood of observing H t under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and //is a 
pair of tokens. 

5. (Original) A system according to Claim 4, wherein the L{H C ) is 
determined, comprising dividing the «-gram into n-\ pairings of segments, 
calculating a likelihood of collocation for each pairing of segments, and selecting 
the maximum likelihood of collocation of the pairings as L(H ( ). 

6. (Currently Amended) A method for finding compounds in a text 
corpus, comprising: 

building a vocabulary comprising tokens extracted from a text corpus; 
and 

iteratively identifying compounds having a plurality of lengths within 
the text corpus, each compound comprising a plurality of 
tokens, comprising: 




selecting n-grams having a same length that is less than a 
length of »-grams selected during a previous 



iteration : 



evaluating a frequency of occurrence for one or more n- 
grams having the same length in the text corpus, 
each «-gram comprising at least one token tokens 
selected from the vocabulary; 



determining a likelihood of collocation for one or more of 



the n-grams having a the same length; and 
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adding the a subset of n -grams having a high highest 
likelihood as compounds to the vocabulary and 
rebuilding the vocabulary based on the added 
compounds. 

7. (Cancelled) 

8. (Currently Amended) A method according to Claim 6, further 
comprising: 

adding only some of the subset of the «-grams having a high highest 
likelihood as compounds to the vocabulary. 

9. (Original) A method according to Claim 6, further comprising^- 
computing the likelihood of collocation as a likelihood ratio X in accordance with 
the formula: 




where L(H t ) is a likelihood of observing H t under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and His a 
pair of tokens. 

10. (Currently Amended) A method according to Claim 9, further 
comprising-: determining L{H C ), comprising: 

dividing the n-gram into n-l pairings of segments; 
calculating a likelihood of collocation for each pairing of segments; 
and 

selecting the maximum likelihood of collocation of the pairings as 
L(H C ). 

1 1 . (Original) A computer-readable storage medium holding code for 
performing the method according to Claim 6. 
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12. (Currently Amended) An apparatus for finding compounds in a 
text corpus, comprising: 

means for building a vocabulary comprising tokens extracted from a 
text corpus; and 

means for iteratively identifying compounds having a plurality of 
lengths within the text corpus, each compound comprising a 
plurality of tokens, comprising: 

means for selecting n-grams having a same length that is 
less than a length of w-grams selected during a 
previous iteration; 
means for evaluating a frequency of occurrence for one or 
more w-grams having the same length in the text 
corpus, each w-gram comprising at least one token 
tokens selected from the vocabulary; 
means for determining a likelihood of collocation for one 
or more of the n-grams having a the same length; 
and 

means for adding a subset of the-/?-grams having a high 

highest likelihood as compounds to the vocabulary 
and means for rebuilding the vocabulary based on 
the added compounds. 

13. (Currently Amended) A system for identifying compounds 
through iterative analysis of measure of association, comprising: 

a stored limit on a number of tokens per compound 

an iterator initially specifying a limit on a number of tokens per 

compound for an iteration and decreasing the limit for a 

subsequent iteration; and 
a compound finder configured to iteratively evaluate evaluating 

compounds within a text corpus, comprising: 
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an n-gram counter configured to determine determining a 
number of occurrences of one or more n-grams 
within the text corpus, each /7-gram comprising up- 
te a number of tokens up to the limit for the 
iteration a maximum number of tokens , which are 
each at least in part provided in a vocabulary for the 
text corpus; 

a likelihood evaluator configured to identify identifying at 
least one «-gram comprising a number of tokens 
equal to the limit for the iteration b ased on the 
number of occurrences and determining a measure 
of association between the tokens in the identified 
/7-gram., and adding each identified /7-gram with a 
sufficient measure of association to the vocabulary 
as a compound token? and rebuilding the vocabulary 
based on the added compound tokens and adjusting 

14. (Currently Amended) A system according to Claim 13, further 
comprising: 

a stored upper limit on a number of identified /7-grams; and 
a limiter identifying a number of /7-grams up to the stored upper limit 
based on the number of occurrences. 

15. (Cancelled) 

16. (Original) A system according to Claim 13, wherein the measure 
of association between the tokens in the identified /7-gram comprises a likelihood 
ratio X. 

17. (Original) A system according to Claim 16, wherein the likelihood 
ratio X is calculated in accordance with the formula: 
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where L(Hj) is a likelihood of observing H under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and His a 
pair of tokens. 

18. (Original) A system according to Claim 17, wherein, for each pair 
of tokens, t\, t 2 , in the identified «-gram, the independence hypothesis comprises 
P(t 2 \t l ) = p{t 2 | t x ) and the collocation hypothesis comprises P(t 2 \t l )> p{t 2 \ t x ). 

19. (Original) A system according to Claim 1 7, wherein the L(H t ) is 
computed for each pair of tokens, t\, h, in the identified «-gram in accordance 
with the formula: 

Lit , U form compound ) 
arg max — -, li — r . 

l{h,) L\n - gram does not form compound ) 

20. (Original) A system according to Claim 13, further comprising: 
an initial vocabulary comprising a plurality of tokens extracted from 

the text corpus. 

21 . (Original) A system according to Claim 20, further comprising: 
a parser parsing the tokens from the text corpus. 

22. (Original) A system according to Claim 13, further comprising: 

a filter determining the number of occurrences of one or more n-grams 
within the text corpus for only unique n-grams. 

23. (Original) A system according to Claim 13, wherein each text 
corpus comprises a plurality of documents comprising one of a Web page, a news 
message and text. 
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24. (Currently Amended) A method for identifying compounds 
through iterative analysis of measure of association, comprising: 

iteratively specifying a limit on a number of tokens per compound for 
an iteration and decreasing the limit for a subsequent iteration; 
specifying a limit on a number of tokens per compound ; and 
iteratively evaluating compounds within a text corpus, comprising: 
determining a number of occurrences of one or more n- 
grams within the text corpus, each «-gram 
comprising up to a number of tokens up to the limit 
for the iteration maximum number of tokens , which 
are eaeh at least in part provided in a vocabulary for 
the text corpus; 
identifying at least one «-gram comprising a number of 
tokens equal to the limit for the iteration b ased on 
the number of occurrences and determining a 
measure of association between the tokens in the 
identified n-gram; aad 
adding each identified n-gram with a sufficient measure of 
association to the vocabulary as a compound token? 
and rebuilding the vocabulary based on the added 

25. (Original) A method according to Claim 24, further comprising: 
providing an upper limit on a number of identified /7-grams; and 
identifying a number of «-grams up to the upper limit based on the 

number of occurrences. 



26. (Cancelled) 
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27. (Original) A method according to Claim 24, wherein the measure 
of association between the tokens in the identified n-gram comprises a likelihood 
ratio X. 

28. (Currently Amended) A method according to Claim 27, further 
comprising-^ calculating the likelihood ratio X in accordance with the formula: 

where L(Hi) is a likelihood of observing H t under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and H is a 
pair of tokens. 

29. (Original) A method according to Claim 28, wherein, for each pair 
of tokens, h, t 2 , in the identified n-gram, the independence hypothesis comprises 
P(t 2 \t t ) = p (t 2 I f \ ) and tne collocation hypothesis comprises P(t 2 \t l )> p(t 2 \ t x ). 

30. (Original) A method according to Claim 28, further comprising: 
computing the L(H t ) for each pair of tokens, t\, t 2 , in the identified n- 

gram in accordance with the formula: 

Lit, ,t 7 form compound ) 

arg max — v 1 ^ . 

l(h,) L\n- gram does not form compound) 

3 1 . (Original) A method according to Claim 24, further comprising: 
constructing an initial vocabulary comprising a plurality of tokens 

extracted from the text corpus. 

32. (Original) A method according to Claim 3 1 , further comprising: 
parsing the tokens from the text corpus. 

33. (Original) A method according to Claim 24, further comprising: 
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determining the number of occurrences of one or more n-grams within 
the text corpus for only unique n-grams. 



34. (Original) A method according to Claim 24, wherein each text 
corpus comprises a plurality of documents comprising one of a Web page, a news 
message and text. 

35. (Original) A computer-readable storage medium holding code for 
performing the method according to Claim 24. 

36. (Currently Amended) An apparatus for identifying compounds 
through iterative analysis of measure of association, comprising: 

means for specifying a limit on a number of tokens per compound for 
an iteration and decreasing the limit for a subsequent iteration 
specifying a limit on a number of tokens per compound ; and 

means for iteratively evaluating compounds within a text corpus, 
comprising: 

means for determining a number of occurrences of one or 
more n-grams within the text corpus, each n-gram 
comprising up to a number of tokens up to the limit 
for the iteration maximum number of tokens , which 
are eaeh at least in part provided in a vocabulary for 
the text corpus; 

means for identifying at least one 77 -gram comprising a 

number of tokens equal to the limit for the iteration 
based on the number of occurrences and means for 
determining a measure of association between the 
tokens in the identified rc-gram; and 

means for adding each identified n-gram with a sufficient 
measure of association to the vocabulary as a 
compound token 7 and means for rebuilding the 
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vocabulary based on the added compound tokens 
and means for adjusting the limit . 
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